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1.  Foreword 

The  main  goal  of  this  proposal  is  to  develop  novel  representations  of  objects  and  environments  for  localization, 
semantic  mapping  and  target  detection.  Majority  of  the  research  efforts  in  mapping  and  visual  perception  for 
robotic  systems,  focused  on  the  problems  of  localization,  map  building  of  doing  both  jointly  know  as  simultaneous 
mapping  and  localization  (SLAM)  problem.  The  maps  proposed  in  the  past  ranged  from  metric,  topological  of 
hybrid  representations  of  the  environments.  While  these  models  are  suitable  for  navigation  tasks,  endowing  such 
models  with  additional  semantic  information  can  enable  more  complex  tasks,  such  as  object  search  or  better 
target/object  detection  as  well  as  more  advanced  interactions  with  humans.  Semantic  labeling  techniques  strive 
to  assign  different  semantic  labels  to  different  partitions  of  the  data  and  use  the  context  of  indoors  and  outdoors 
environments  improve  the  state  of  the  art  of  existing  visual  localization  strategies,  contextual  target  detection  and 
recognition  and  semantic  mapping.  The  focus  of  our  approach  is  on  the  development  of  unified  representations, 
which  can  be  adopted  to  the  task  at  hand.  These  representations  and  framework  for  learning  and  inference  will 
be  an  integral  part  of  perceptual  capabilities  of  a  robotic  system  and  will  be  evaluated  using  different  sensing 
modalities  and  different  tasks  in  indoors  and  outdoors  environments. 

2.  Problem  Statement 

The  semantic  mapping  of  the  environment  requires  simultaneous  segmentation  and  categorization  of  the  ac¬ 
quired  stream  of  sensory  information.  The  existing  methods  typically  consider  the  semantic  mapping  as  the  final 
goal  and  differ  in  the  number  and  types  of  considered  semantic  categories.  We  envision  semantic  understanding 
of  the  environment  as  an  on-going  process  and  seek  representations  which  can  be  refined  and  adapted  depending 
on  the  task  and  robot’s  interaction  with  the  environment.  In  this  work  we  propose  a  novel  and  efficient  method 
for  semantic  parsing,  which  can  be  adopted  to  the  task  at  hand  and  enables  localization  of  objects  of  interest  in 
indoors  environments.  For  basic  mobility  tasks  we  demonstrate  how  to  obtain  initial  semantic  segmentation  of  the 
scene  into  ground,  structure,  furniture  and  props  categories  which  constitute  the  first  level  of  hierarchy.  Then,  we 
propose  a  simple  and  efficient  method  for  predicting  locations  of  objects  that  based  on  their  size  afford  a  manipu¬ 
lation  task.  In  our  experiments  we  use  the  publicly  available  NYU  V2  dataset  [8]  and  obtain  better  or  comparable 
results  than  the  state  of  the  art  at  the  fraction  of  computational  cost.  We  show  the  generalization  of  our  approach 
on  two  more  publicly  available  datasets. 

3.  Summary  of  the  results 

Over  the  duration  of  the  project  we  have  made  several  significant  contributions  in  semantic  understanding  of 
multimodal  sensory  data  in  indoors  and  outdoors  environments.  We  will  briefly  summarize  them  below,  while  the 


1 


additional  details  can  be  found  in  the  accompanying  publications. 


Priming  Object  Detection  In  [2,  3]  we  have  demonstrated  very  efficient  algorithm  which  for  initial  semantic 
segmentation  of  the  scene  into  ground,  structure,  furniture  and  props  categories  which  constitute  the  first  level 
of  hierarchy.  The  main  technical  insights  was  the  use  of  minimum  weight  spanning  tree  approximation  of  the 
inference  graph,  which  was  computed  on  3D  depth  data  and  effective  and  efficient  to  compute  features.  These 
choices  enabled  us  to  use  well  conditioned  exact  inference  techniques  for  the  learning  and  estimation  of  the  final 
labeling  and  yielding  improved  performance  at  the  fraction  of  the  computational  cost  on  the  standard  benchmark 
RGB-D  dataset  on  NYU  V2.  The  initial  version  of  this  work  was  published  in  a  workshop,  followed  by  submission 
and  acceptance  of  the  work  to  International  Journal  of  Robotics  Research  [3]. 

Recursive  Semantic  Labeling  In  the  follow  up  work  we  have  extended  the  static  semantic  parsing  to  a  video 
setting  and  proposed  a  recursive  Bayes  filter  style  updating  mechanism  [1].  In  this  problem  we  focused  on  out¬ 
doors  environments  and  exploited  widely  available  exemplars  of  non-object  categories  (such  as  road,  buildings, 
vegetation)  and  used  geometric  cues  which  are  indicative  of  the  presence  of  object  boundaries  to  gather  the  evi¬ 
dence  about  objects  regardless  of  their  category.  We  have  carried  out  extensive  experiments  on  videos  of  urban 
environments  acquired  by  a  moving  vehicle  and  show  quantitatively  and  qualitatively  the  benefits  of  our  proposal. 
Another  notable  feature  of  the  resulting  approach  was  close  to  real-time  performance  of  the  whole  system  (5  fps), 
including  the  feature  computation  and  inference. 

Heterogeneous  Coverage  In  the  previous  approaches  were  were  able  to  compute  the  semantic  labeling  for 
regions  of  the  images  and  video  using  only  one  sensing  modality,  incorrectly  interpolate  measurements  of  other 
modalities  or  at  best  assign  semantic  labels  only  to  the  spatial  intersection  of  coverages  of  different  sensors.  In 
this  work  we  proposed  a  method  for  inferring  semantic  labels  Using  the  previously  proposed  strategy  for  inducing 
the  graph  structure  of  Conditional  Random  Field  used  for  inference,  in  this  work  we  proposed  a  novel  method 
for  computing  the  sensor  domain  dependent  potentials.  This  strategy  enabled  us  to  achieve  superior  semantic 
segmentation  for  the  regions  in  the  union  of  spatial  coverage  of  the  sensors,  while  keeping  the  computational  cost 
of  the  approach  low.  The  problem  is  illustrated  in  Fig.  1.  For  example  with  an  image  sensor  note  how  in  column 
(b)  one  portion  of  the  car  is  confused  with  the  ground  because  their  colors  are  similar.  We  demonstrated  how 
to  combine  the  visual  sensing  with  the  evidence  from  a  3D  laser  sensor  and  mitigate  sensor  specific  perceptual 
confusers,  column  (c),  but  now  we  are  only  able  to  explain  a  subset  of  the  scene,  the  spatial  intersection  coverage, 
leaving  us  without  output  for  the  car  glass  and  the  building  in  the  top  portion  of  the  image.  With  the  strategy  we 
introduced,  we  can  take  the  advantage  of  both  sensor  modalities  without  discarding  the  non-overlapping  zones, 
column  (d)  in  Fig.  1. 


Finer  Grained  Semantic  Labeling  The  previous  techniques  we  have  discussed  very  efficient  semantic  labeling 
techniques  for  small  number  of  semantic  categories.  This  was  possible  due  to  efficient  features  and  inference 
algorithms.  In  order  to  obtain  better  discrimination  capabilities  for  different  categories  additional  features  and 
alternative  inference  algorithms  have  to  be  computed.  We  proposed  to  formulate  the  multi-class  object  recognition 
and  segmentation  in  RGB-D  data  using  many  binary  object-background  segmentation,  using  informative  set  of 
features  and  grouping  cues  for  the  small  regular  superpixels.  The  main  novelty  of  the  proposed  approach  is  the 
exploitation  of  the  informative  depth  channel  features  which  indicate  presence  of  depth  boundaries,  the  use  of 
efficient  supervised  object  specific  binary  segmentation  and  effective  hard  negative  mining  exploiting  the  object 
co-occurrence  statistics.  The  binary  segmentation  is  meaningful  in  the  context  of  robotics  applications,  where 
often  only  the  object  of  interest  need  to  be  sought.  This  yields  an  efficient  and  flexible  method,  which  can  be 
easily  extended  to  additional  object  categories.  We  report  the  performance  of  the  approach  on  NYU-V2  indoors 


(a)  (b)  (c)  (d) 

Figure  1.  We  propose  in  this  work  a  new  approach  to  semantic  parsing,  which  can  seamlessly  integrate  evidence  from  multiple 
sensors  with  overlapping  but  possibly  different  fields  of  view  and  account  for  missing  data,  while  predicting  semantic  labels 
over  the  spatial  union  of  sensors  coverages.  The  semantic  segmentation  is  formulated  on  a  graph,  in  a  manner  which  depends 
on  sensing  modality.  First  row:  (a)  over- segmentation  over  the  image;  (b)  graph  induced  by  the  superpixels;  (c)  the  3D  point 
cloud  re-projected  on  the  image  with  a  tree  graph  structure  computed  in  3D,  and  (d)  the  full  graph  as  proposed  here  for  full 
scene  understanding.  In  the  second  row  is  the  semantic  segmentation  (a)  ground  truth  and  results  of  (b)  using  the  image  graph 
and  only  visual  information;  (c)  using  the  3D  graph  and  visual  and  3D  information,  and  finally  (d)  the  result  from  using  a 
graph  for  full  coverage  and  all  the  information.  Note  the  best  semantic  segmentation  achieved  over  the  union  of  the  spatial 
coverage  of  both  sensors.  Color  code:  Mground,  Mobjects,  Mbuilding  and  Mvegetation. 
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33.00 

27.00 

4.60 

Ren[6] 

42.00 

28.00 

33.00 

17.00 

28.00 

17.00 

19.00 

1.20 

7.80 

27.00 

15.00 

3.30 

37.00 

9.50 

39.00 

28.00 

10.00 

Gupta  [  ] 

55.00 

44.00 

40.00 

30.00 

33.00 

20.00 

9.30 

0.65 

33.00 

44.00 

4.40 

4.80 

48.00 

6.90 

47.00 

34.00 

10.00 

Ours(unary) 

50.64 

37.44 

25.00 

19.19 

25.93 

23.88 

26.40 

3.28 

32.12 

29.77 

9.17 

2.89 

27.42 

9.79 

34.68 

25.59 

21.04 

Ours(CRF) 

56.85 

42.29 

31.44 

20.78 

30.16 

30.29 

34.97 

3.00 

32.95 

33.09 

10.06 

3.99 

29.34 

10.04 

33.82 

30.11 

23.35 
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Silberman[8] 

5.90 

13.00 

7.20 

16.00 

4.40 

6.30 

13.00 

6.60 

36.00 

19.00 

1.40 

3.30 

3.60 

25.00 

27.00 

0.11 

0.00 

Ren[6] 

13.00 

7.00 

20.00 

14.00 

18.00 

9.20 

12.00 

14.00 

32.00 

20.00 

1.90 

6.10 

5.40 

29.00 

35.00 

13.00 

0.15 

Gupta  [4] 

8.30 

22.00 

22.00 

6.80 

19.00 

20.00 

1.90 

16.00 

o 

o 

d 

28.00 

15.00 

5.10 

18.00 

26.00 

50.00 

o 

o 

37.00 

Ours  (unary) 

14.72 

32.35 

32.81 

6.68 

23.09 

16.22 

7.64 

19.54 

17.93 

16.16 

16.86 

10.67 

25.54 

10.98 

26.06 

7.62 

36.25 

Ours(CRF) 

17.16 

35.73 

34.19 

12.14 

27.41 

21.54 

10.07 

30.31 

22.21 

22.98 

20.59 

13.46 

26.84 

11.04 

38.65 

8.61 

37.69 

Table  1.  Performance  on  the  NYUD-V2  dataset  in  Jaccard  index. 


dataset  and  demonstrate  improvement  in  the  global  and  average  accuracy  compared  to  the  state  of  the  art  methods. 
The  brief  summary  of  the  results,  highlighting  two  different  performance  measures  in  different  categories  can  be 
seen  in  Tables  1  .  More  details  of  the  proposed  methods  can  be  found  in  [7]  and  the  follow  up  submission,  which 
is  currently  under  review. 
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Robotic  Perception 


Methods  for  modeling  3D  geometry  of  the 
environment  and  semantic  concepts 

Seamless  processing  of  video  streams 

Recursive  and  multi-view  setting 

Efficient  Inference 


Robotic  Tasks 


Navigation 


Mapping 

Localization 


Manipulation 


Human 

nteraction 


Representations 


Road,  floor,  free  space 
landmarks,  locations 

I  Structural  Obj.  I 

(doors,  ...) 

^BspecificObj^^B 

(traffic  signs,  cups, 

I  Dynamic  Obj,  People 

I  (cars,  bikes,  ...) 


Representation  which  can  be  computed  efficiently 

and  are  reusable  for  multiple  tasks,  extended  efficiently  to  new 

semantic  concepts 


Tasks 


Mapping 

Localization 


Representations 


•  Visual  odometry  -  linear  algorithm  adopted  to  360  FOV 

•  No  need  for  bundle  adjustment 

•  Guided  sampling  from  entire  FOV  for  RANSAC 


Tasks 


Mapping 

Localization 


Representations 


•  Point  features  tracking 

•  Recovery  of  relative  motion,  visual  odometry 

•  Loop  closure 

•  Environment  models,  sparse  clouds  of  points 

•  Often  sufficient  for  navigation 

•  Not  sufficient  for  navigation  free  space/obstacles 

•  Location  recognition 

•  Data  assoc,  with  large  variations  of  appearance 

•  Semantic  information  for  object  recognition,  human 
robot  interaction 

•  Obtain  high  quality  denser  env.  models 

•  Associate  useful  semantic  inform,  with  regions/ 
volumes  in  3D  space 


3D  geometry,  Features,  Semantics 

3D  geometry:  move  from  sparse  feature  points  to 
dense  models,  reason  about  surfaces  etc. 

Overcome  challenging,  matching  and  correspondence 
problems 

Semantic  Categorization:  learning  and  inference  in  a 
multi-view/recursive  setting  Learning  and  inference 


Multi-view  reconstruction  using  super-pixels 
Task  Dependent  Sematic  Parsing 


Semantic  Hierarchy 

Strongly  depends  on  the  current  task 


Specific  Obj. 
(traffic  signs, 
garbage  bins, 

■■■) 


Mapping 


Semantic  Segmentation 
“Background”  /  Objects 


People  I  Dynamic  Obj. 

I  (cars,  bikes,  ...) 


Structural  Obj. 
(doors,  ...) 


Manipulation 


Human 

Interaction 


images  — ^  3D  model 
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Challenges  of  indoors  and  outdoors  environment 
X  no  or  repetitive  texture 

X  illumination,  scale,  viewpoint  changes,  occlusions  ... 
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Multi-view  Superpixel  Reconstruction 

■  pre-segment  the  reference  image  into  superpixels 


(Felzenszwalb  &  Huttenlocher,  IJCV'04) 


■  large  support  area  ->  more  robust  measures 

Goal:  to  find  depth  and  normal  for  each  superpixel,  11  =  [n^  d]' 

Assumption:  each  super-pixel  corresponds  to  a  planar  patch  in  3D 


11 


Known  camera  projective  matrices  Pk 


k 


■  searching  for  MAP  of  MRF  as  a  labelling  problem 


argmin 

{n,d}l-s| 


^photo  +  A  E  E, 


smooth 


{4 


{s,s'} 


X  Intractable  over  all  depths  and  normals  ! 


Plane  sweeping 


4 


^  Plane  induced  homography 


Hfc(n,  Pk,  Kref)  =  Kfc  (Rfc  -  tk  n'^/d) 


Known  cameras  ...  Pk  =  Kfc[R/c  tfc] 
Unknown  plane  ....  11  =  [n^  d]' 
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Restriction  of  depths 

For  each  plane  normal  sweep  along  depth  range  and 
remember  only  depth  candidates  ! 


and  optimize 


argmm 

V 


V  =  {[n  d]^} 


^  ^  Ephoto  E  Ai  ^  ^  EggQYji  X2  ^  ^  EjiQj.Yn  A3  ^  ^  E, 


depth 


{s,s'} 


{s,s'} 


over  the  depth  candidates  and  dominant  normals  ! 
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Plane-induced  homography 


one  depth,  one  normal 
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Sweeping 


^R,G,B  ^  Q 

q-R^G^B  _ 


C/c 


ormalize  all  splx  projections 


[ji^  jjp 


T 


+  jJj^  +  /i^ 

chromacity  vector 


compute  normalized  histograms 
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Photometric  measure 


■  photometric  measure  for  each  splx  projection  -  histogram  difference 
and  chromaticity  for  each  reference  view/view  pair 


Ck{d)  —  +  Ot\\Ck  —  Cref 


2 


■  composite  photometric  measure  for  all  views 


C{d),Ckid) 
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C{d),Ck{d) 


Depth  candidates 
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■  For  each  normal  store  A/-best  minima  as  depth  candidates  in  matrix 


Labelling  Problem  -  Energy  terms 


•  Nodes  of  the  graph  -  superpixels 

•  Edges  of  the  graph  -  induced  by  neighboring  superpixels 

•  Typical  pair  wise  MRF 


arg  min  TE(s)+y'E(s,s') 

S  {s,s') 


•  In  our  case: 


argmin 

V 


21 


E(s,s',ls,ls') 


Labeling 


argmm 

V 


V  =  vector  Igxi 


Labeling:  assign  a  lah>lto  eanh  snix 
plane  hypothesis  (normal 


E( 


S,ls) 


^  available  solvers:  Kolmogorov  PAMI'Oe,  Werner  PAMI'OZ 


Photoconsistency  term 


Ephotoi^S ^  Is) 


Color  histogram  and 
chromacity  differences 
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Geometric  term 

Splx  boundaries  are  usually  consistent  with  dominant  directions 
gradient  mixture  model  (Goughian  &  Yuille  NC'03)| 


Pair  wise  normai  term 
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■  Force  neighboring  spixs  to  have  same  normal 


Ising  prior 
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Pair  wise  depth  term 

■  Force  neighboring  spixs  to  touch  in  3D 


TP.  1  .  1  f  c?  c?^  /  /  -  wn ‘n  1 

X(x,n3,ds)  -  X(x,ns/,ds/)||  ^ 

.^deptn\'^i  ^  i''ai'‘s  /  —  miij.  i  iim^va 

X(x,ns,ds) 

utilizing  plane  priors 


Compute  MAP  estimate  (run  the  MRF  graph  solver  once) 


Bayesian  formulation 


depth  map 


■  C2{d)  =  C{d)-2a^logp{ns) 
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GMU  building 


3D  model 


3D  Reconstruction  of  street  scenes 


Oxford  corridor 
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using  6  images  3D  model 


Multi-view  Superpixel  Reconstruction 

Superpixel  3D  reconstruction  of  one  indoors  -  outdoors  scenes 

less  computational  complexity 
alternaticve  photometric/similarity  measures 
piecewise  planar  surfaces  can  be  favorably  handled 

■  Extensions  to  general  mostly  planar  scenes,  real-time  settings 
(Gallup’09) 

■  Integration  of  3D  reconstruction  with  recognition 


Associating  Semantic  Information 


Segmentation  and  categorization  of  different  partitions 
of  sensory  data  using  geometric  and  appearance  cues 


Navigation  -  free  space,  occupied  space 


Localization/Place  recognition  -  static  portions  of  environment 


Object  search,  human  detection  -  generate  hypotheses  about 
presence  of  objects 

Object  Categorization 
Object  Instance  recognition 


Semantic  Segmentation 

1.  Task  Dependent  Semantic  Hierarchy 

2.  Muiti-view  and  Recursive  formuiation 

3.  Capabiiity  of  handiing  missing  3D  data  (fuii  sensor’s 
FOVs  coverage) 


Proposed  Semantic  Hierarchy 

•  Coarse  to  fine  manner 


_ 

Human 

Interaction 


Proposed  Semantic  Hierarchy 

“Background”  and  object  categories 


•  Non-object  categories 
specific  to  types  of 
scenes 

•  we  can  assume  to  be 
present  (almost)  always 

•  mostly  static  or  slow 
changing 

•  Props/Objects 


KITTI  dataset,  Geiger  et  al.  IJRR 


Ground 

Buildings 

Vegetation 

Sky 

Objects 


Mapped  from  Sengupta  et  al.  ICRA 
2013 


NYU  V2  dataset,  Silberman  et  al.  ECCV 
^12  ^  ^ 


Ground 

Structure 

Furniture 

Props 


Our  formulation 


Semantic 

Segmentation 


Conditional  Random  Fields 


MAP 

Marginais 


Preprocessing 


Graph  Structure 
Potentiais 


inference 


Learning 


Preprocessing:  Over-segmentation 


SLIC  superpixels 


Preserve  contours  -  Regularity  -  Efficiency 

R.  Achanta  et  al.  SLIC  superpixels  compared  to 
state-of-the-art  superpixel  methods,  PAMI,  2012. 


Graph  Structure:  Our  choice 


Minimum  Spanning  Tree 
Over  3D 


Edges  are  determined  by  the  MST  over  3D  distances  between 
superpixeis’  centroids 


Graph  Structure:  Our  choice 


Formulation  as  Pairwise  CRFs 


Directly  models  j9(x|z) 


/;(x|z)  = 


exp(  Y,  w,Tf(xi,z)  + 


i.jeS 


Potentials 


Pairwise  CRFs 


z 

Xz 

f 


o 


p(x|z)  = 


exp  I  Y, 


Z(z) 


iej\f 


z)  +  Z  wjg(xzj,z) 

i.jeS 


Potentials:  Pairwise  CRFs 


Potentials:  pairwise 


Favor  (penalize)  same  class  for  nodes  close  (far)  in  Lab  color 
Favor  (penalize)  different  classes  for  nodes  far  (close)  in  Lab 


C  :  Lab  color 


Same  form  with  3D  positions 


Potentials:  unary 

unary  (local)  potential 
using  a  k-NN  classifier 


■  /(xi,z) 


log  Piiyii 


t(/y)  frequency  of  label  j  in  a  k-NN  query 
F(/^)  frequency  of  label  j  the  database 


The  database  is  a  kd-tree  of  features  from  training  data 


Features 

From  Image 

Indoors  (15D) 

Outdoors(21D) 

Lab-color:  mean  and  std 

6D 

6D 

RGB-color:  mean  and  std 

6D 

vertical  centroid  location 

1D 

1D 

entropy  from  vanishing  points 

1D 

From  3D 

3-D  centroid  position 

2D 

3D 

differences  on  depth:  mean  and  std 

2D 

2D 

local  planarity 

1D 

1D 

neighboring  planarity 

1D 

1D 

vertical  orientation 

1D 

1D 

Features 

•  From  3D 

•  mean  and  std  of  differences 
on  depth 

•  local  planarity 

•  neighboring  planarity 

•  vertical  orientation 


•  From  Image: 

•  entropy  from  vanishing  points 


0 


4 

Hs  =  -Y,  hgs(y  =  i)  log  (hgs{y  =  j)) 

J  =  1 


Inference 

•  We  use  belief  propagation: 

•  Exact  results  in  MAP/marginals 

•  Efficient  computation,  in  0{nm  ) 

Learning 

•  Maximum  Likelihood  Estimation 

[w^,  wj 

•  To  learn 


Tree  graph  structure 
good  convergence 


Results:  NYU-Depth  v2  Dataset 


Qualitative 

comparison: 


GroundTruth  MAP 


Ground 

Structure 

Furniture 


Props 


Results:  NYU-Depth  v2  Dataset 

Quantitative  comparison: 

•  Recall  accuracy  in  pixel-wise  percentage: 
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Furniture 
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Global 

CRF-MST-kNN 

88.4 

64.1 

30.5 

78.6 

65.4 

67.2 

only  Image  Feat. 

63.2 

47.5 

24.5 

73.6 

52.2 

56.1 

only  3D  Feat. 
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16.9 

79.4 

62.7 

65.8 

data-term  (kNN) 

87.3 

60.6 

33.7 

74.8 

64.1 
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Silberman  et  al.  2012 

68 

70 

42 

59 

59.6 

58.6 

Couprie  et  al.  2013 

87.3 

45.3 

35.5 

86.1 

63.5 

64.5 

Results:  KITTI  Dataset 


•  Recall  accuracy  in  pixel-wise  percentage: 


Ground 

Objects 

Building 

Vegetation 

Average 

Global 

CRF-MST-kNN 

97.3 

82.9 

82.8 

86.9 

87.5 

88.4 

only  Image  Feat. 

96.8 

49.2 

64.6 

95.5 

76.5 

76.8 

only  3D  Feat. 

95.9 

84.2 

80.5 

46.7 

76.8 

78.8 

data-term  (kNN) 

96.8 

75.9 

80.7 

77.6 

82.8 

83.5 

Results:  KITTI  Dataset 


Ground  Truth  MAP  results 


Ground  Buildings  Vegetation 
Objects 


Overview 


2.  Multi-view  and  Recursive  formulation 


Multi-view: 


O# 

xi 


Different  views/nodes  in  their 
local  reference  frame  -  mean 
3D  position 


Multi-view: 


Use  relative  pose  to 
align  the  nodes  to  the 
same  reference  frame 


Multi-view: 


MST  in  the  common 
reference  frame 

No  need  to  find  hard 
correspondences 


p(xi,X2  Zi,Z2,ri2) 


Recursive  Inference 


e.g.  visual 
odometry 


Semantic 

Segmentation 


Video  sequences: 


On-line  operation 

Infer  the  marginals  at 

k-\ 
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Recursive  Inference 


Marginals  at  k-\ 


Sensing  at  time  k 


Recursive  Inference 


Infer  marginals  at  time  k 


p{Kk\-Xk-l,Zk,Tk) 


Recursive  Inference 


•  Now,  the  inference  runs  over  a  forest. 


F1 

KITTI  Dataset 

Ground 

Objects 

Building 

Vegetation 

Time 
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Time 

BP 

Single  View 

0.977 

0.854 

0.870 

0.811 

21  ms 

164  ms 

Recursive  Inference 
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0.853 

0.879 

0.809 

57  ms 

69  ms 

Single  view  vs.  Recursive  mode 


Original 

video 


Singlevie 
w  mode 


Recursiv 
e  mode 
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Im  X 
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Test  in  Dynamic  Street  -  KiTTi 


Removing  the  object  class 
from  the  reprojected 
pointcloud: 


r  M 


Overview 


1.  Task  Dependent  Semantic  Hierarchy 


Multiple  Sensor  Modalities 


Different  fields  of  view,  missing  3D 
data 

Every  sensor  suffers  from  specific 
blind  spots,  e.g. 

•  Laser:  limited  range,  specular 
surfaces 

•  Vision:  low  light  conditions 


•  Depth  (Kinect):  natural  light, 
specularities 

Every  modality  suffers  from 
different  sources  of  ambiguities 


Multiple  Sensor  Modalities 

•  So  far,  we  have  dealt  with  the  ‘semantic’ 
ambiguities  fusing  image  and  3D  sensors 


•  But  still,  only  over  the  common  spatial 
coverage,  and  without  handling  missing  data 


Full  Coverage  -  Formulation 

•  Define  a  graph  structure  for  full  coverage 


Full  Coverage  -  Formulation 

•  Define  a  graph  structure  for  full  coverage 

•  Domain  based  potentials 


Graph  Structure 

•  start  with  the  MST  over  3D  distances  as  before 


•  Look  for  the  image  graph  on  superpixels  without  3D 


•  And  compute  the  MST  over  Lab-color  distances 


Graph  Structure 


Sub-graph  over  3D 


Sub-graph  over  Image 


Results  in  a  graph  for  full  coverage 


CRFs  Formulation 


p(x|z) 
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Unary  Potentials 
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Inference  and  Learning 


This  graph  is  not  a  tree 

We  use  Loopy  Belief  Propagation  for  Inference 

Our  experiments  always  have  converged  in  less  than  5 
iterations 


Learning  with  MPL 


Results:  KITTI  Dataset 


Results:  KITTI  Dataset 


•  Recall  accuracy  in  pixel-wise  percentage: 
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building 

vegetation 

sky 
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Coverage 

Image  only 

Sengupta  et  al.  [21] 
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93.9  48.5  49.3  —  — 
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88.4 

— 
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CRF-Imn3D“  [3] 
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82.8 

86.9 
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88.4 

60.1 

CRF-Im  u  3D 

96.6 

83.6 

86.1 
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Best  performance  in  average  and  global 
accuracies 


Results:  NYU  Depth  V2  Dataset 


Im  and  3D  ImorSD 
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Results:  NYU  Depth  V2  Dataset 


•  Recall  accuracy  in  pixel-wise  percentage: 
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Overview 


1.  Proposed  Semantic  Hierarchy 

2.  Basic  Formulation:  Graph,  Appearance  and 
3D  data 

3.  Multiview  and  Recursive  formulation 

4.  Full  sensor’s  FOVs  coverage 


5.  Conclusions 
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Timing 
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Semantic 

Segmentation 


Semantic  Segmentation 
for  Robotic  Systems 


Single  view  and  Video 


Semantic 

Segmentation 


Semantic  Segmentation 
for  Robotic  Systems 


Feedback  of  priming 
classes 


Semantic  Segmentation 
for  Robotic  Systems 


3D  reconstruction 
localization  — 


Semantic - 

Segmentation 


Specific 

detectors 


Finer  classes 


Semantic  Refinement 


.  Refinement  of  the  object  category 


Binary  object-of-interest  vs  background  segmentation  task 


Endow  superpixels  with  richer  features 


Learned  one-vs-all  AdaBoost  per  object. 
Equal  proportion  of  positive-negatives  ex. 


Negative  mining  to  select  samples  from  other  objects  that  co¬ 
occur  with  the  object  of  interest 


Object  Level  Segmentation  Jaccard  Index 
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Conclusions 


Computational  efficient  approach  for  semantic 
segmentation 

We  see  it  as  the  first  stage  of  a  scalable  semantic 
understanding  for  mobile  robots 

Our  approach  effectively  uses  3D  and  Images  cues 

Both  3D  reconstruction  and  semantic  segmentation 
formulated  on  the  same  graph  induced  by  superpixels 

We  exploited  the  versatility  and  flexibility  of  CRFs  to 
connect  and  use  different  sensory  modalities  for  full 
coverage 
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Semantic  Parsing  of  Open  Environments 

Simultaneous  segmentation  and  categorization  of  partitions 
of  sensory  data  into  background  and  object  categories 
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■tail  light 
■parking  meter 
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■door 
■fence 
■column 
■wall 
■sign 

windshield 


Semantic  Parsing  of  Open  Environments 

Endowing  environment  models  with  semantic  information 
can  enhance  robustness  and  sophistication  of  robotic  tasks 


In  Computer  Vision,  often  in 
single  view  setting,  expensive 
preprocessing,  learning  and 
inference 

Robotics  requires  processing 
of  video  and  multiple  sensor 
modalities 

Semantic  categories  can 
constrained  by  the  tasks 

Incremental  re-usable 
representations 


■  car 

■  window 

■  wheel 

■  building 

■  road 

■  sky 

■  tree 

■  sidewalk 

■  tail  light 

■  parking  meter 

■  headlight 

■  door 

■  fence 

■  column 

■  wall 

■  sign 
windshield 


Semantic  Segmentation 

Task  constraints 

•  Navigation  (road,  free  space) 

•  Localization  (landmarks) 

•  Manipulation  (object 
categorization,  object  search) 

•  Reasoning  about  static  and 
dynamic  object  instances 


Ground 

Truth 


Semantic  Concepts  &  Tasks 


Semantic  Hierarchy 

“Background”  and  object  categories 


•  Non-object  categories 
specific  to  types  of 
scenes 

•  we  can  assume  to  be 
present  (almost)  always 

•  mostly  static  or  slow 
changing 


KITTI  dataset,  Geiger  et  al.  I JRR  201 3 


Ground 

Buildings 

Vegetation 

Sky 

Objects 


Mapped  from  Sengupta  et  al.  ICRA  2013 


•  Generic  objects  share 
some  characteristics 


Furniture 

Props 


Different  time  scales 


Approach 


MAP 

Marginals 

MAP 


Conditional  Random  Fields 


Preprocessing 


Graph  Structure 


Inference 


Potentials 


Previous  methods  suffer  from  :  Learning 

^  Expensive  over-segmentation,  Expensive  features  Expensive  Inference 


Graph  Structure: 


Minimum  Spanning  Tree  Over  3D 


•  SLIC  superpixels,  preserve  contours,  regularity,  efficiency 

•  Edges  are  determined  by  the  MST  over  3D  distances 
between  superpixels’  centroids 


Graph  Structure:  Our  choice  MST  over  3D 


Intra-class  components  are  naturally 
connected 


Potentials:  Pairwise  CRFs 


MAP  inference  in  CRF: 

compute  most  likely  labels  x  given  observations  z 


Potentials:  unary  &  pairwise 


unary  (local)  potential 
using  a  k-NN  classifier 

■  /(Xi,z)  =  -logPi(Xi|z) 


Pi  (Xi  =  I  j 


Z  =  r]  = 


J 


%W(h 


3 


t((;  )  frequency  of  label  j  in  a  k-NN  query 
F(/^  )  frequency  of  label  j  the  database 


Features 


indoors  (15D) 


•  From  3D 

•  mean  and  std  of  differences 
on  depth 

•  local  planarity 

•  neighboring  planarity 

•  vertical  orientation 

•  From  Image: 

•  entropy  from  vanishing  points 


outdoors(6D) 


4 

s  =  -  ^  Hiy  =  i)  log  {hgsiy  =  j)) 

i=i 


Inference 

•  We  use  belief  propagation: 

•  Exact  results  in  MAP/marginals 

•  Efficient  computation,  in  0{nrn) 

Learning 

•  Maximum  Likelihood  Estimation 


•  To  learn  [w^,  Wp] 


Tree  graph  structure 
good  convergence 


Results:  NYU-Depth  v2  Dataset 


Qualitative 

comparison: 


GroundTruth  MAP  results 


Ground 

Structure 

Furniture 

Props 


Results:  NYU-Depth  v2  Dataset 

Quantitative  comparison: 

•  Recall  accuracy  in  pixel-wise  percentage: 
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only  3D  Feat. 

89.5 

70.0 

16.9 

79.4 

62.7 

65.8 

data-term  (kNN) 

87.3 

60.6 

33.7 

74.8 

64.1 

64.9 

Silberman  et  al.  2012 

68 

70 

42 

59 

59.6 

58.6 

Couprie  et  al.  2013 

87.3 

45.3 

35.5 

86.1 

63.5 

64.5 

Results:  Independent  datasets 


B3D0:  Berkeley  3-D  Object  Dataset 
Janoch  and  Karayev,  2011 
http://kinectdata.com/ 


RGB  Depth  MAP 

results 


RGB-D  Object  Dataset 

K.  Lai,  L.  Bo,  X.  Ren  and  D.  Fox,  201 1 

http://www.cs.washington.edu/rabd-dataset/ 


RGB  Depth  MAP 

results 


Generating  Object  Hypotheses 

•  Given  the  knowledge  of  the  size  of  object  to  be 
manipulated,  priming  object  detection 

•  Generate  object  hypotheses  using  the  prob.  map 


Results:  KITTI  Dataset 


Ground  Buildings 

Vegetation  Objects 


Results:  KITTI  Dataset 


•  Recall  accuracy  in  pixel-wise  percentage: 


Ground 

Objects 

Building 

Vegetation 

Average 

Global 

CRF-MST-kNN 

97.3 

82.9 

82.8 

86.9 

87.5 

88.4 

only  Image  Feat. 
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95.9 

84.2 

80.5 

46.7 

76.8 
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data-term  (kNN) 
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Results:  NYU-Depth  v2  Dataset 
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Computational  Cost  (s) 


Initial  implementation  in  C++  with  SLIC  in  GPU  5fps 


Multiple  Sensor  Modalities 


Different  fields  of  view 

Every  sensor  suffers  from  specific 
blind  spots,  e.g. 


•  Laser:  limited  range,  specular 
surfaces 

•  Vision:  low  light  conditions 

•  Depth  (Kinect):  natural  light, 
specularities 

•  Every  modality  suffers  from 
different  sources  of  ambiguities 


Multiple  Sensor  Modalities 


Previously  ‘semantic’  ambiguities  fusing 
image  and  3D  sensor 

Common  spatial  coverage 

Without  handling  missing  data 


Extension  to  union  of  FOV’s 


Graph  Structure 

•  Sub-graph  over  3D 


•  Sub-graph  over  Image 


•  Results  in  a  graph  for  full  coverage 


Recursive  Inference 


e.g.  visual 
odometry 


Semantic 

Segmentation 


Video  sequences: 


On-line  operation 


Infer  the  marginals  at 
k-\ 
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Recursive  Inference 


p{Xk-i\Zk-i) 


Marginals  at  k-l 


Sensing  at  time  k 


Recursive  Inference 


Infer  marginals  at  time  k 
p{yik\yik-\,'Z‘k,Tk) 


Recursive  Inference 


F1 

KITTI  Dataset 

Ground 

Objects 

Building 

Vegetation 

Time 

MST 

Time 

BP 

Single  View 

0.977 

0.854 

0.870 

0.811 

21  ms 

164  ms 

Recursive  Inference 

0.977 

0.853 

0.879 

0.809 

57  ms 

69  ms 

Finer  Grained  Categorization 


•  We  formulate  the  problem  of  recognition  and  segmentation  of 
objects  in  indoor  scenes  as  a  binary  object-of-interest  vs 
background  segmentation  task.  Learn  per  category  binary 
segmentations 

•  Our  choices: 

-  Enrich  set  of  features 


-  Low  level  per  category  grouping  rules 
are  learned  in  CRF  setting 


dresser  floor  mat  lamp  mirror  night  stand  paper  person 
objects 


Fine  grained  categorization 


•  Extend  the  set  of  features: 


-  Color,  Texture  histograms  Histograms  C1,  T1 

-  Geometric  Features  (previous)  G1 


-  Generic  Features  G2  (planarity  features,  alignment  with  respect  to 
gravity,  orientation  context) 


object  co-occurrence 


Descriptor  type 

description 

Cl 

75  dim.  histogram  HSV  values 

T1 

240  dim.  histogram 

G1 

1 1  dim.  descriptor  of  geometric  features 

G2 

60  dim.  descriptor  of  generic  features 

•  Adaboost  classifier  with  Decision  trees 


•  Exploit  hard  negative  mining  using  context 

•  Sampling  negative  example  proportional  to 

•  Co-occurrence 


bookshelve  co-occurrence 


Finer  grained  categorization 


•  Object  recognition  and  segmentation 

•  Train  per  class  object-background  models  CRF’s 


Evaluation  in  terms  of  per  class  segmentation  accurracy, 
using  Jaccard  Index 
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•  Improves  state  of  the  art  performance,  very  efficient 
(computational  bottleneck  feature  computation) 


More  detailed  per  category  results 
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Conclusions 


Computationally  efficient  approach  for  semantic 
segmentation,  effective  use  of  image  and  3D  cues 

Proposed  Semantic  Hierarchy:  Background/Objects 

CRF  Framework,  Efficient  exact  inference  on  trees  in  3D 

Recursive  setting  and  multiple  sensing  modalities 


Refining  Semantic  Hierarchy  for  Objects 


Life-long  Semantic  Mapping 

•  Reusable  Representations  of  sensory  streams, 
which  will  generalize  across  different  environments 

•  New  semantic  concepts  can  be  learned 
incrementally,  fine  grained  semantic  categories 

•  Tightly  couple  localization,  reconstruction,  mapping 


